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Abstract 

In the rapidly evolving domain of next generation sequencing and 
bioinformatics analysis, data generation is one aspect that is increasing at 
a concomitant rate. The burden associated with processing large amounts 
of sequencing data has emphasised the need to allocate sufficient com¬ 
puting resources to complete analyses in the shortest possible time with 
manageable and predictable costs. A novel method for predicting time to 
completion for a popular bioinformatics software (QIIME), was developed 
using key variables characteristic of the input data assumed to impact pro¬ 
cessing time. Multiple Linear Regression models were developed to deter¬ 
mine run time for two denoising algorithms and a general bioinformatics 
pipeline. The models were able to accurately predict clock time for denois¬ 
ing sequences from a naturally assembled community dataset, but not an 
artificial community. Speedup and efficiency tests for AmpliconNoise also 
highlighted that caution was needed when allocating resources for parallel 
processing of data. Accurate modelling of computational processing time 
using easily measurable predictors can assist NGS analysts in determining 
resource requirements for bioinformatics software and pipelines. Whilst 
demonstrated on a specific group of scripts, the methodology can be ex¬ 
tended to encompass other packages running on multiple architectures, 
either in parallel or sequentially. 

Keywords — Computational performance, bioinformatics pipelines, Multiple 
Linear Regression modelling 
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1 Introduction 


Next-Generation Sequencing Analysis 

The rapid increase in the utilisation of Next Generation Sequencing (NGS) tech¬ 
nologies amongst disparate fields of research has resulted in a new set of chal¬ 
lenges for scientific researchers. Whilst technology selection, sequencing costs 
and efficacy of sample preparation are still considerable concerns, pragmatic 
approaches in decision-making and implementation will result in satisfactory 
sequence generation. However, potentially vast quantity of data generated by 
sequencing efforts suggests that the true bottleneck is the computational anal¬ 
ysis of the sequence data 17, 3l]. Indeed, without considered planning of the 
bioinformatics analysis, data processing time and costs can far exceed those 
of the actual sequencing itself 3], diminishing the benefits attributed to high- 
throughput technologies. 

Whilst the utilisation of computing resources and data analysis tools is gen¬ 
erally accessible to the scientific and research communities, the ability to harness 
their full potential is often limited to specialists. Faced with the increasing ubiq¬ 
uity of bioinformatics tools for post-sequencing analysis and the realisation that 
decentralised research is becoming more prevalent, bioinformaticians are tasked 
with creating applications that are reliable, scalable, and user-friendly. 


The prevalence of bioinformatics workflows and pipelines such as Galaxy 10 
QIIME core analysis [5] and Taverna [ 14 , and the increasing collaborative efforts 


through shared infrastructures 28 suggests greater uptake by non-specialists 


will be forthcoming. As NGS technology develops, the challenges faced by both 
bioinformaticians and users relate specifically to the competency of the software 
tools and the performance of the hardware to handle increasingly larger and 
more complex datasets. 

The lack of appropriate hardware infrastructure is the greatest contributing 
factor to the bioinformatics bottleneck and the rise in virtual environments, par¬ 
allelised code and super-computing facilities is testament to an understanding 
of the need for continual development and innovation in NGS data handling 
and management 30 . However, these structural and programmatic facilitators 


are not without their drawbacks. For example, cloud computing facilities such 
as Amazon Web Services’ Elastic Compute Cloud [I] offer flexible and scalable 
environments for performing a wide range of bioinformatics, but issues around 
data security and file transfer rates coupled with per hour usage costs make 
resource planning an integral requirement for any data analysis project. 


1.1 Understanding capacity and performance 

The capacity for processing and analysing NGS data accurately is dependent 
on identifying the most suitable software and hardware for the task. Incor¬ 
rect selection or mismatch between data requirements and architecture will in¬ 
evitably lead to suboptimal performance and, potentially, poor or erroneous 
results. With a wide range of bioinformatics tools being utilised by both spe- 
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cialists and non-specialists, there is a need for greater transparency in their 
deployment to facilitate effective and efficient analysis. As time and cost are 
often constraining factors in research and corporate environments, the ability 
to assign resources with a priori knowledge of the performance and run time is 
of great benefit. For example, Cunningham was developed to provide accurate 


runtime estimates for BLAST analysis of large shotgun sequence datasets 34 


A recent study using the 16S gene for estimating bacterial diversity has 
shown the quantity and size of sequence clusters affects accuracy in non-parametric 
diversity calculations, whilst also determining which methods to employ [4]. 

Parallelisation of bioinformatics algorithms aimed at dramatically decreasing 
their processing time by exploiting multiple core processors [§] or GPU capabil¬ 
ities 18j has alleviated some of the analysis bottleneck. Code optimisation can 
also make significant performance gains in highly parallel applications 32 , but 


often requires expertise in coding that is not always accessible or practical for 
the end-user. 

Parallel speedup and efficiency are key performance metrics that can be used 
to assess the most effective use of multiple CPU cores or nodes in a Cluster, Grid 
or Cloud environment. Whilst it may be intuitive that splitting large compu¬ 
tational jobs amongst a greater number of processors should lead to increasing 
reductions in processing time, the presence of code that must be run serially 
(ser) in most algorithms means that only a fraction of the work benefits from 
parallel (par) speedup. Amdahl’s Law [2] describes the speedup of a process 
across multiple cores ( P ) given an amount of work (N) as: 


S(N, P) = 


t(N,P= 1) 


( 1 ) 


t(N,P) 

In ideal parallel processes, speedup is therefore equal to 1/P, but with a 
fraction of code being serial, this equation becomes: 


S(N, P) = 


t(N) 

ser + W 

t(N) s , 


iU) p 


( 2 ) 


Knowing the speedup of parallelisation, then the efficiency may also be cal¬ 
culated simply by: 


E = 


S(N,P) 

P 


(3) 


For algorithms that are parallelisable, it is useful to perform these calcula¬ 
tions to get an understanding of the scalability of the processes on any given 
architecture. This will help users to more appropriately assign resources and 
avoid problems with latency or parallel overheads. 


1.2 Computational transparency for targeted analysis 

Recent advances in sequencing technology have brought about unprecedented 
resolution in identification and classification of bacterial species. Sequencing of 
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the highly conserved 16S rRNA gene is popular as it allows for comparative 
studies of microbial communities, their diversity and structure. The ubiquity 
of this approach in amplicon-based metagenomics coupled with the dwindling 
cost of high-throughput sequencing has put emphasis on development of tools 
and hardware infrastructure that can handle increasingly data rich sequence 
analysis. 

A discretised pipeline was developed to model the relationship between se¬ 
quence data size and complexity, and computational resource. An overview of 
the pipeline components is shown in Fig. |A1| (Appendix |A|), comprising several 
typical processing and analysis protocols available in the QIIME software. The 
performance was measured using the time taken to complete each process step 
in real and CPU metrics. The clock or real time is necessary to determine actual 
resource cost, but can be skewed on systems where other extraneous processes 
are running, adding load to the shared resource. CPU time, characterised as 
the sum of user and system time, is reflective of the actual work done by the 
process being monitored. 


2 Methods 

2.1 System architecture 

The system used for evaluation was a 64-bit 2 x 6-Core (Intel Xeon 2.66 GHz 
CPU) Apple MacPro with 32GB RAM running OS X 10.8.2. 

For the performance testing 16 logical cores (8 physical cores with hyper¬ 
threading) were used for the parallelisable components of the pipeline (e.g. De- 
noising) and single CPU otherwise. All analysis results are specific to this 
architecture and configuration. All cited instances of QIIME relate to its OSX 
compilation, MacQIIME Version 1.6.0. 


2.2 Training and validation datasets 

2.2.1 Training data 


Three datasets containing 16S microbial rRNA gene fragments were used to 
develop the performance models. The first training dataset (MFC) was taken 
from an acetate fed microbial fuel cell reactor inoculated with arctic soil sourced 
from Arctic soil (Ny-Alesund, Spitsbergen, Svalbard), and operated at 26.5°C 
12 . The second dataset was generated from a sample taken from a small 
eutrophic lake in the English Lake District (Priest Pot) in 2008 [26 . The 
third dataset ( Arctic) were sequences derived from DNA extracted from Arctic 
mineral soil samples collected from the Svalbard region [Unpublished]. 

The Priest Pot sequences were generated using standard 454 GS-FLX chem¬ 
istry and targeted the V5 hyper-variable region, whilst the MFC and Arctic 
samples were sequenced with the more recent GS-FLX Titanium chemistry, 
which gives longer read lengths and targeted the V4-V5 regions, as at the time 
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of sequencing (2011), they provided the highest classification accuracy with low¬ 
est amplification bias. 

In the case of the MFC and Arctic data, samples were originally pooled 
using barcodes to provide a multiplexed dataset. However, only one sample was 
selected and processed through the pipeline to avoid the effects of redundancy 
when processing multiple samples as a single batch. 

2.2.2 Validation data 

The first validation dataset (Mix) were sequences taken from a laboratory scale 
batch reactor sample, which had been used to study anaerobic digestion of do¬ 
mestic wastewater at 15°C. The sample was sequenced in January 2013 using 
the same method employed for the Arctic data. The second validation dataset 
(. Artificial ) consisted of an artificial community from 90 clones that was py- 
rosequenced over the V5 region of the 16S rRNA gene with a 454 GS-FLX 
sequence r [26] . 

Table [l] summarises each of the datasets used in this analysis and includes 
the number of reads, average read length, number of putative OTUs and a 
diversity (equitability) of the raw, unfiltered sequences. Because of inherent 
sequencing errors, the OTU and equitability values are likely to be over- and 
underestimated, respectively. As can be seen, the samples sequenced with the 
older 454 GS-FLX chemistry are much shorter than those sequenced with the 
Titanium chemistry. After trimming to remove primer and barcode, the mean 
read lengths are approximately 200 bp for GS-FLX and 400 bp for Titanium. 


Table 1: Test and Validation Datasets: Number of Sequences (Seqs); Aver¬ 
age Read Length (bp); Number of OTUs at 97% Cutoff (OTU); Equitability 
Estimate 15 of Total Samples (a) 


Source 

Dataset 

Seqs 

bp 

OTU 

a 

MFC 

Test 

72003 

411 ± 45 

828 

0.481 

Priest Pot 

Test 

28361 

244 ± 52 

1146 

0.613 

Arctic 

Test 

21576 

426 ± 53 

2267 

0.807 

Mixed sediment 

Validation 

19718 

423 ± 49 

2390 

0.801 

Artificial 

Validation 

46341 

260 ± 38 

177 

0.536 


2.3 Analysis steps 

An initial subsampling of the test datasets was performed using the QIIME 
subsample-fasta.py script to randomly split the raw Fasta file into subsamples 
of 5% to 95% fractions at intervals of 5%. Six repeats were generated at each 
interval for the GS-FLX sequenced Artificial dataset and two repeats for the 
Titanium sequenced Arctic dataset (due to the greatly increased computational 
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time required to denoise these sequences). Thus, for the denoising steps, 76 
samples were used for training and 133 samples for validation. 

A simple bash script was written to simplify and automate the performance 
testing by looping through the subsamples, processing according to each step 
in the performance pipeline (Fig. Al) and passing the Real and CPU time for 
execution to a separate output file. 

The subsampled Fasta files are pre-processed using the QIIME splitJibraries.py 
script, which applies some basic quality filtering to the sequences for read length 
trimming, ambiguous base checking and primer and barcode removal. 

The subsamples are then ready for denoising to correct for errors generated 
in the PCR and sequencing steps. QIIME Denoiser 29 is a heuristic algorithm 
that uses a greedy alignment scheme before clustering flowgrams in descending 
order of abundance. Erroneous reads are filtered from the cluster to produce the 
final denoised sequences. Denoiser can use pre-filtered Fasta files and matches 
the IDs of the remaining reads with those present in the text translation of the 
raw SFF file, to avoid denoising of poor quality sequences. Chimera checking is 
an optional but often important step that is performed independently from de¬ 
noising using the ChimeraSlayer 11 tool via QIIME’s identify-chimericseqs.py 


script. 

For denoising using AmpliconNoise, the Standard Flowgram File (SFF) as¬ 
sociated with the data was split into 19 subsamples with sizes corresponding to 
the set used for QIIME denoising using SFF Workbench 13 via a Wine [35] 


translation of the Windows API. The individual SFF files were then converted 
to text translations using the QIIME processsff.py script. There is no need for 
demultiplexing of the data as only a single sample is used in the pipeline test. 
The AmpliconNoise software [26] uses Bayesian theory to generate an approxi¬ 
mate likelihood from empirical error distribution data to infer true read identity 
given sequencing error (PyroNoise) and PCR error (SeqNoise). An additional 
chimera checking step using Perseus [27 is performed after error removal. The 
software was run using the QIIME wrapper script ampliconnoise.py rather than 
via the stand-alone package to maintain a consolidated workflow. 

The denoised reads from QIIME denoiser were used for downstream analysis 
using the following steps: 


• De Novo OTU picking: Clustering of sequences with 97% similarity 
threshold using the uclust method [F| , before picking representative Oper¬ 
ational Taxonomic Units (OTUs) from each cluster by the sequence first 
assigned to a cluster. 


Assign Taxonomy: Assignment of taxonomic identities to the OTUs 
using the curated Greengenes 16S rRNA gene database 
Bayesian classifier, RDP 33 . 


21 with a naive 


• Alignment: Alignment of the sequences is necessary for comparative 
analysis, such as /3 diversity, in which phylogenetic distances are used to 
understand differences in community composition from distinct samples. 
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Alignment was performed using the PyNAST method and pairwise clus¬ 
tering with uclust [7] against a Greengenes database template. Gap only 
columns and highly variable regions within the alignment files are removed 
using a filtering step that makes use of Greengenes compatible Lanemask 
file for excluding these positions. Phylogenic relatedness of organisms 
within a sample may also be of interest and the creation of a phylogenetic 
tree using the Fast Tree 2.1.3 22 method is performed after alignment. 


• Diversity analysis: Diversity metrics are key outputs from the QIIME 
pipeline and are used to gain an quantitative understanding of the dis¬ 
tribution and relatedness of organisms within a sample (a diversity) or 
between different samples (/3 diversity), a diversity uses the abundance 
data stored in the OTU table generated from the OTU picking and tax¬ 
onomy assignment steps to calculate a range of metrics provided by the 
user, such as Chaol, Shannon and Phylogenetic Distance. Rarefaction 
plots are generated for each metric based on random subsampling (using 
a pseudo random number generator) of the OTU table between a given 
range of sequences per sample and at a given step size. (3 diversity uses 
both the information stored in the OTU table and a phylogenetic tree, 
if the phylogenetic metrics are calculated using Unifrac. In this analysis, 
both quantitative (weighted) and qualitative (unweighted) Unifrac met¬ 
rics 19 were calculated and Principal Coordinate Analysis plots generated 


to display the results. 


Details of the parameters used for each analysis step are provided in TablefAl] 
(Appendix [A| . 


2.4 Performance measures 

Both the wall clock and CPU (Usr + Sys) were recorded using the GNU time 
command, which allows for formatting of the output in Mac OSX, and stored 
as text files. Whilst the wall clock time indicates the amount of real time 
between execution and completion of a process, CPU time is more indicative of 
the computing effort required to run the process. However, when considering 
running large datasets through a bioinformatics pipeline, time to completion is 
the measure by which costs can be assessed. 

File input size was determined based on the process step being analysed as 
shown in Tablc [A2| ( Appendix[A| . The number of reads and mean read length per 
Fasta file were determined using the QIIME countseqs.py script. Equitability 
was calculated using the QIIME a diversity metric script on the raw input fasta 
files supplied to the denoising algorithms. 

The time and predictor (diversity, number of reads and read length) data 
was imported into Matlab |20] and a stepwise Multiple Linear Regression (MLR) 
was applied to fit a model between the explanatory variables and clock time. 
The MLR model takes the following generalised form: 

Hi = P 0 + P\Xi,l + p2Xi,2 + + PpXi^p + £i (4) 
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where yt are the predictands, x,^ p the predictors, fi p the regression coefhcients, 
and e.j is the error term. The regression coefficients are the solution to the least 
squares estimation: 


P = [X T X)~ 1 X T y i (5) 

where X is the matrix of regressor variables. The Matlab function LinearModel, 
from the Statistics Toolbox (v8.2), was used to perform the stepwise regression 
using the polynomial form described in eq. [4] The starting model included the 
intercept, linear terms, interactions and power terms with interactions up to 
a factor of 4 for each explanatory variable. The algorithm uses forward and 
backward regression with the Sum of Squared Error (SSE) to add and remove 
terms from the model based on the p-values of the F-statistic with and without 
a potential term. The thresholds for adding and removing terms were p < 0.05 
and p > 0.1, respectively. 

A conventional method for assessing goodness-of-fit for linear regression 
models is to calculate the R 2 value or coefficient of determination. As the com¬ 
plexity of the polynomial increases (by adding more variables), the R 2 value 
will increase, which may result in a skewed confidence in the reliability of the 
model. The adjusted- R 2 value is used to address this issue: 


c>2 

adj 


= l - 


^J2(vi-y) 2 


){n 


n — p — 1 


1 ) 


( 6 ) 


where y t , y and y z are the observed data, mean of the observed data and mod¬ 
elled predictands, respectively, n is the number of observations and p the number 
of regressors. 

Parallel speedup and efficiency were calculated according to Amdahl’s Law 
for AmpliconNoise based denoising of the MFC sample, as this step is observed 
to be the most computationally intensive part of the pipeline and is compiled 
to run in parallel. Denoising was performed using 1, 2, 4, 8, 16, 32, 64 and 128 
physical cores, with 48GB RAM per node (six cores per node) on the DIAG 
resource |6 for two sub-samples of the total sequences (20000 and 40000 reads). 


3 Results and discussion 

3.1 Model development 

Performance modelling was initiated on the two test datasets using the stan¬ 
dard QIIME pipeline tools for processing and analysis of 16S rRNA sequencing 
data. The wall clock (Real) and CPU time were recorded for each pipeline step 
and a model fit between these variables and input read number was made. For 
the denoising algorithms (QIIME denoiser and AmpliconNoise) a Multiple Lin¬ 
ear Regression (MLR) model was deemed necessary as visualisation of the read 
number versus clock time highlighted that more than one explanatory variable 
was influencing the denoising time. Community diversity within a sample, or 




more specifically taxonomic rank-abundance, is known to influence computa¬ 


tional effort of denoising 29 


The equitability, or evenness, defines the homogeneity of species within a 
community, with higher values of the index indicating a highly even or homo¬ 
geneous distribution of species. This diversity metric was used as a second 
explanatory variable in the MLR model as a candidate for determining cluster¬ 
ing speed during denoising. Despite diversity calculation being a component of 
the QIIME pipeline, the parameter values used for the model were generated by 
calculating equitability from the raw input fasta files, prior to quality checking 
and noise removal. Although the value is not a true measure of diversity due 
to the presence of errors, it is assumed the error profile across all samples is 
equivalent and this will not affect the model. 


3.2 Modelling sequence denoising strategies 

3.2.1 Training 

Denoising of high-throughput sequencing data with the most commonly used al¬ 
gorithms (QIIME clenoiser and AmpliconNoise) is clearly the major performance 
bottleneck in the analysis pipeline, but also one of the most critical in terms of 
determination of more accurate OTU numbers and subsequent classification. 

A MLR model was developed with two explanatory variables (number of 
reads (A) and sample equitability (a)) as predictors and wall clock time (y) as 
the response variable. 

By simply observing the relationship between the explanatory and response 
variables, it is evident that a non-linear implementation of the MLR model is 
necessary. The introduction of power terms in the model is intended to reflect 
the curvlinear nature of the underlying dependencies. 

Stepwise MLR models were developed using the three training datasets for 
the QIIME denoiser (Eq. [7]) and AmpliconNoise algorithms (Eq. [8]). The mod¬ 
els take the form given by equation [4] with non-linear power terms and cross- 
products between the two predictor variables. 

Vqd = fiiQ. + factX + /?3A” + A + + /3 6 A 3 + fi’jcP'}? + /3saA 3 (7) 

The regression coefficients (/3) are shown in Table [2j The results from the 
QIIME denoiser model suggest a conformity between the two explanatory vari¬ 
ables selected and the resulting predictand. Fig. [T] shows excellent prediction 
(Adjusted R 2 > 0.9) for all training data, which is confirmed by performing 
an ANOVA on the full model (F-statistic = 7.38xl0 3 , p-value = 2.54xl0 -95 ) 
indicating that the non-linear model is highly significant. All plots are shown 
in relation to equitability for ease of visualisation, however, an example 3D 
plot (embedded in the uppermost plot in Fig. |T|) for the MFC data shows the 
excellent fit against both explanatory variables. 

For AmpliconNoise, an initial two parameter training model produced good 
fits for two datasets (Arctic and MFC), but could not fit the Priest Pot data. 
It was surmised that read length could be an additional factor in determining 


9 




• •' - 

~ 1 1 





Adjusted R 2 = 0.9428 


• PP observed 

• PP predicted 
95% Cl 

* • . 







- 








- 


* 






- 

- 

• 

"T- 





- 

■ 



_ 





0.64 0.66 


0.68 


0.7 

0.72 

0.74 



1600 
1400 
g 1200 
| 1000 

i 800 

o 600 
400 
200 



Equitability 

Figure 1: QIIME denoiser model performance for the three training datasets; 
fit plotted along the equitability axis. Two parameter fit is shown in the 3D 
embedded figure 


processing time during the sequencing error removal ( Seqnoise ) step, given the 
importance of sequence size in influencing error rate distribution 9|. Including 
mean read length per sample (p) as a third parameter in the model decreased 
the Root Mean Square Error of Calibration (RMSEC) from 5750 to 129 and the 
improvement can also be seen in Fig. [2] Although prediction is not as convincing 
as with the QIIME denoiser data, the model, shown in equation[8j is still highly 
significant (F-statistic = 1.60xl0 3 , p-value = 2.60xl0 -51 ). 

Van = Pi p + /3 2 A + p 30 i\ + P 4 X 2 + P^cxp + PepX + PtP~ 

+ Ps a X 2 + /3gA 3 + /3io<aA 3 (8) 


3.2.2 Validation 

Validation of regression based models is critical to ascertain their ability to be 
used with independent data. In complex, highly parameterised models, there is 
a risk that overfitting of the data may occur, in which the model tends to fit to 
the training data, but lacks the predictive capacity when fitting to validation 
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Figure 2: AmpliconNoise model performance for the three training datasets; fit 
plotted along the equitability axis 


or real-world datasets. The models developed for both denoising algorithms 
were tested with the independent validation datasets to assess their suitability 
for prediction of processing time. For both denoising algorithms the models 
fit the Mixed sediment dataset well, with adjusted R 2 values of 0.97 and 0.72 
for QIIME denoiser and AmpliconNoise, respectively. However, the models do 
not predict the Artificial data (See Fig. [3]), which suggests that there is some 
underlying property of the artificially generated sequence communities that is 
not captured during training. The dataset was constructed to represent a com¬ 
munity with log-normal distribution analogous to true community distributions 
found in the environment 26 , and it is possible that this artificial construct 


has presented some feature in the data that has a significant impact on de¬ 
noising performance. As discussed, the Priest Pot data was acquired from the 
same sequencing technology and the inclusion of the mean read length in the 
AmpliconNoise model had significant impact on training performance. 

A main effects plot, shown in Fig. | A2| (Appendix [A| , was generated to look 
at the contributions from each independent model variable on the clock time. 
For both the two and three variable models, the number of reads was the largest 
contributing factor to the output variability. It can also be seen that, although 
mean read length has a small impact on the AmpliconNoise model, justifying 
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Figure 3: QIIME denoiser model prediction for the two validation datasets; fit 
plotted along the equitability axis 


its inclusion, it is not the critical factor in explaining the poor performance with 
the Artificial dataset. 

3.3 General pipeline model 

The pipeline algorithms deployed sequentially without parallelism (i.e. on a sin¬ 
gle CPU core) generally contribute insignificant burden to the overall processing 
time compared to the denoising step. However, based on system memory avail¬ 
ability and CPU processor speed, scripts related to sequence alignment, OTU 
picking, taxonomy assignment and diversity calculation, may become cumber¬ 
some, especially for large sequencing runs. QIIME includes several parallel 
commands to handle such conditions, but in this study the test environment 
and dataset sizes were such that single CPU processing was sufficient. A MLR 
model was developed using the total wall clock time measured for all analysis 
steps independent of denoising, as shown in Table |A2| Although non-continuous, 
as quality filtering occurs prior to denoising, whereas all other steps are down¬ 
stream, the intention is to indicate a relationship between processing time and 
predictor variables that will aid resource allocation. Additionally, it should be 
noted that the predictor variables will undergo changes during the pipeline as 
reads are removed, trimmed and truncated, particularly at the upstream end of 
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the pipeline. However, monitoring changes in read number and diversity is im¬ 
practical, especially with automated pipelines and given the underlying aim of 
assessing resource requirements a priori. It is assumed that the underlying cor¬ 
relation between predictors and response hold true for the agglomerated model 
as the relative changes in predictor variables are expected to be uniform across 
all training samples for any given pipeline. 

Observation of individual pipeline step results indicated generally linear re¬ 
lationships between number of reads and clock time but, as with the denoising 
algorithms, additional confounding factors appeared to have a role in determin¬ 
ing response across datasets. An initial model was developed using number of 
reads and equitability, which gave satisfactory training results but produced 
poor fitting for the Artificial validation data (Adjusted R 2 of 0.333). A three 
factor model that included mean read length was investigated and greatly im¬ 
proved prediction of the articitial validation dataset (Adjusted R 2 of 0.704), 
but with a slight decrease in the model fit for the Mixed dataset (Adjusted 
R 2 reduced from 0.914 to 0.841). The final pipeline MLR model (F-statistic 
= 1.27 xlO 3 , p-value = 5.43x10 53 ) is shown in equation[9]and the regression 
coefficients in Table [2j with the fitted training and prediction curves presented 
in Figs. [4] and [5j respectively. 

Upipe = /3 0 + A A + P 2 OL + /?3 p + /3±a\ + f3§A 2 + f3 9 ap + fUjA p + fi%p 2 

+ p 9 a\ 2 (9) 


3.4 Speedup and Efficiency 

Due to the unfeasibly long processing time for AmpliconNoise and, to a lesser 
extent, QIIME denoiser, there may be a tendency to invest heavily in high- 
performance computing solutions to dramatically reduce the run time. Whilst 
there is some guidance on memory requirements for running the algorithms 
(1GB for FLX sequences with QIIME denoiser 23 , > 8GB for large datasets 


when running AmpliconNoise |24| ), there may be a tendency for employing as 
large a resource as economically and logistically feasible. Aside from the costs 
involved, acquisition of hardware for long-term use may result in redundancy 
unless demand is significant. Deployment on decentralised systems may also 
result in conflict if non-essential capacity is being utilised for the denoising 
task. 

The results of the speedup and efficiency tests performed with Amplicon¬ 
Noise on the decentralised DIAG [61 resource are shown in Fig. [6] The speedup 
plot shows that actual performance improvement is far from ideal when utilis¬ 
ing more processors, reaching a threshold of approximately 7.5 times speedup 
with 128 cores, which corresponds to 5% efficiency shown in the second plot. 
AmpliconNoise is clearly not a massively parallel algorithm, with many serial 
components contributing to the dramatic reduction in efficiency with greater 
parallel resources. 
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No. Reads 


Figure 4: General pipeline model performance for the three training datasets; 
fit plotted along the read number axis 


Table 2: Regression Coefficients for the MLR Models of the QIIME Denoiser 
and AmpliconNoise Algorithms, and General QIIME Pipeline Steps 


Regress, coeff. 

QIIME 

Ampliconnoise 

Pipeline 

Po 

0 

0 

2820.200 

Pi 

1.718 

-47.687 

-0.012 

P 2 

-5.360 

1.709 

556.420 

p3 

l.lxlO" 4 

-7.098 

-19.001 

Pa 

4.096 

2 .2xKT 4 

0.045 

p5 

-4.6x 10” 4 

127.790 

2 .8x10"' 

P 6 

6.2xlO" 10 

0.017 

-1.640 

P7 

4.7x 10" 4 

-0.184 

-2.1xl0" 5 

Ps 

-1.4xl0" y 

-5.2x 10” 4 

0.032 

P9 


-1.6x 10~ 8 

-5.7x10"' 

PlO 


3.Ox 10~ 8 



Based on the analysis and considering the tradeoff between time to comple¬ 
tion and resource utilisation/expenditure, between 8 and 16 (shown as a vertical 
bar in Fig. [6]) cores appear reasonable for this architecture. Using more than 
16 cores does not deliver significant increases in speedup, whilst efficiency drops 
below 30%. 
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Figure 5: General pipeline model prediction for the two validation datasets; fit 
plotted along the read number axis 


4 Conclusion 


The QIIME software package is widely employed for microbial community anal¬ 
ysis using data derived from a range of NGS technologies. The protocols defined 


by 16 offer a standard methodology for processing sequencing reads and trans¬ 


forming that data into interpretable information such a sample diversity and 
phylogenetic distances. The benefits of using QIIME for downstream analysis of 
NGS data is its ability to pipeline a range of bioinformatics steps in a consistent 
and reproducible manner, several parallelised scripts for data intensive process¬ 
ing and its portability across a range of high-throughput and high-performance 
environments such as Amazon Cloud, Virtual Box/CloVR and Grid services 
such as DIAG. 

When considering what resources to utilise for post-sequencing analysis, it 
is important to have an understanding of computational requirements for the 
tools employed. This is vital when considering deploying algorithms on resources 
that provide a service at a cost related to time utilised. However, it may also 
be important for users requiring rapid turnaround from sequencing to informa¬ 
tion, those wishing to optimise a pipeline or invest in additional computational 
resources, and in cases where the resource is utilised by multiple analysts. 

The work presented demonstrates that there exists a significant relationship 
between characteristics of the input sequencing data and the computational time 
required to process that data. Although the models developed are characteristic 
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No. Processors 


Figure 6: Speedup and Efficiency metrics for subsampled MFC data (20000 and 
40000 reads) across multiple cores on the DIAG resource using AmpliconNoise 


of the system that they were calibrated against, the MLR modelling technique 
can be applied to generate models for any architecture. 

Two and three variable models developed for sequence denoising algorithms 
were successful in predicting time for completion, but suffered with the Artificial 
data. This was potentially due to the nature of the dataset construction, which 
was not captured in the model development, which consisted entirely of sam¬ 
ples from naturally distributed communities. However, a three variable general 
pipeline model predicted total time for completion of the standard QIIME anal¬ 
yses with 6% and 7% error for the Mixed and Artificial datasets, respectively. 

There is often a temptation when working with computationally intensive 
algorithms to allocate the maximum amount of resource to the problem without 
considering if this will be optimal. Parallel speedup and efficiency analysis for 
the AmpliconNoise algorithm revealed that denoising with increasing number 
of CPUs is far from ideal. The efficiency of the computation decreases rapidly 
beyond a single core processor, whilst speedup is not significantly increased 
beyond 32 cores and the analysis suggests the use of between 8 and 16 processors 
is sufficient under the test architecture. 

Accurate modelling of the relationship between input sequence data and time 


16 



























to completion has shown to be a viable method for supporting data analysts in 
decision-making related to resource allocation. Although restricted to a single 
architecture and set of tasks, the methodology can and should be applied in 
a broader context to understand if the approach is transferable across a range 
of applications (e.g. read mapping, genome assembly). A standardised model 
may be idealistic, but the modelling effort is minimal when contrasted with the 
expected data throughput that is beginning to emerge in sequencing today. As 
sequencing technology is evolving at a remarkable rate, it will be useful to look 
at assessing the methodology presented here against more diverse data sets, in 
terms of size and source environment. This should be coupled with attempts 
to identify how generic the models are across different system architectures and 
platforms. 

It is suggested that by developing techniques and tools for modelling com¬ 
putational requirements for bioinformatics analysis, coupled with methods for 
estimating sequencing effort required to generate the necessary depth of infor¬ 
mation (see 25 ), then analysts can be armed with the capabilities to compre¬ 


hensively plan their research efforts, assess resource requirements and funding 
needs. 


List of abbreviations 

MFC Microbial Fuel Cell 

MLR Multiple Linear Regression 

OTU Operational Taxonomic Unit 

PCR Polymerase Chain Reaction 

RMSEC Root Mean Square Error of Calibration 
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Appendix A Supplemental Material 


Performance pipeline 




JT® * 

d-7 Sp//f Fas/a 
Pre-processing Sp/mSFF 


Mutli-core system 


Quality filtering 
Demultiplexing pooled samples 




Denoising 


Amplicon Noise 


QIIME denoiser 


SL¬ 


OW) Picking 


Clustering into OTUs 
Representative seqs 



--N 


_sz_ 


Taxonomy Assignment! 


RDP Classifier 

Use reference database 


Greengenes DB 


1 Sequence Alignment 

L Pm-aligned Template . _K wi sequences^ 
Sequences ^ 


Filter alignment 


_£ 



Chimera Checking 

Identify chimeras 

Filter FASTAfile 

1_ 




i. 


Calculate a Diversity 


Calculate 0 Diversity 

Rarefy OTU table 
Compute a diversity 
Collate rarefied tables 
Make rarefaction curves 


Rarefy OTU table 

Compute 0 diversity 

Generate principal coordinates 
Make PCoA plots 


Figure Al: The general analysis pipeline utilised within QIIME for modelling 
performance 
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Table Al: Pipeline Parameters for Relevant Analysis Steps 


Script 

Parameters 

Flag 

Value 

Split libraries 

min seq length 
max seq length 

-1 

-L 

150 

1000 

AmpliconNoise 

seqnoise resol. 

-s 

25 (Titanium) 

30 (FLX) 

Pick OTUs 

method 

similarity 

-m 

-s 

uclust 

0.97 

Representative 

set 

method 

sorting 

-m 

-s 

first (dust, seed) 
otu 

Taxonomy 

assignment 

method 

confidence 

-m 

-c 

rdp 

0.8 

Alignment 

method 

pairwise method 
min percent ID 
min length 

-m 

-a 

-P 

-e 

pynast 

uclust 

75 

150 

Identify chimeras 

method 
fragments 
taxonomy depth 

-m 

-n 

-d 

chimera slayer 

3 

4 

Filter alignment 

allowed gap frac. 
threshold 

-g 

-t 

0.999999 

3 

Make phylo. 
tree 

method 

-t 

fasttree 


Table A2: Pipeline Script Inputs According to Type. Representative Sequences 
are Equivalent to OTUs After Clustering 


Script 

Input type 

Quality filtering 

Raw sequences 

Denoising 

Trimmed sequences 

OTU picking 

Denoised sequences 

Taxonomy assignment 

Representative sequences 

Alignment 

Representative sequences 

Chimera removal 

Representative sequences 

Phylogeny 

Chimera free OTUs 

OTU table 

OTUs & taxonomy assignments 

Diversity 

OTU table & phylogenetic tree 
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xIO 4 


Equability: 0.61358 to 0. 


Reads: 812 to 54002 


Length: 244.5878 to 401.2671 



Equitability: 0.48346 to 0.91134 


Reads: 878.5 to 47920.67 


Length: 322.7001 to 400.98 



Figure A2: Main effects plots for the three models 
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